Parquet/async: Default to suffix requests on supporting readers/object stores #6157

H-Plus-Time · 2024-07-30T09:40:21Z

Which issue does this PR close?

Closes #5979 .

Rationale for this change

Described in #5979, but more or less the point is to make parquet::async::AsyncFileReader agnostic toward the size of the file (when suffix requests are supported, notably excluding Azure blob storage).

What changes are included in this PR?

get_bytes range parameter type change -> GetRange
MetadataLoader::load file_size parameter removed (file_size aware load stubbed out as load_absolute, pending a better name)
fetch_parquet_metadata file_size parameter shifted to an Option
ParquetObjectReader ObjectMeta swapped for object_store::Path (likely ObjectMeta will be re-added as an optional property provided via the builder interface)

Are there any user-facing changes?

More or less all of the above.

…requests, swap get_bytes to take a GetRange (equivalent to object-store's)

H-Plus-Time · 2024-07-30T10:05:28Z

Aspects that I suspect are just bad:

GetRange - not exactly a fitting name 🤷
get_byte_ranges - no longer Vec wrt get_bytes. Intended to avoid dealing with throwing out GetRange::Suffix before passing into ObjectStore::get_ranges / reimplementing that.

Aspects that need consensus:

which of the two options (suffix, or non-suffix) gets the load method name (i.e. which is the default), and what the non-default method name should be.
How explicit should switching on (or if it's the default, off) suffix requests be?

kylebarron

Thanks for starting this @H-Plus-Time!

kylebarron · 2024-07-30T15:00:08Z

parquet/src/arrow/async_reader/mod.rs

+                }
+            };
+            // TODO: figure out a better alternative for Offset ranges
+            let mut buffer = Vec::with_capacity(to_read.unwrap_or(1_024usize));


Suggested change

let mut buffer = Vec::with_capacity(to_read.unwrap_or(1_024usize));

let mut buffer = Vec::with_capacity(to_read.unwrap_or(1024));

The usize type here can probably be inferred?

Indeed it can, oops

One thing that's worth checking - is there ever a situation where it's sensible/vaguely useful for a parquet reader to do an unbounded read (the GetRange::Offset variant)?

I would think you'd always have a known upper bound, since knowing where to start a read (e.g. read all row groups starting from row group N) implies you have the FileMetadata already.

…mple

kylebarron · 2024-08-21T01:52:09Z

@H-Plus-Time do you need any help with this? Or is this at a state where it needs API decisions from maintainers?

H-Plus-Time · 2024-08-21T03:39:33Z

@H-Plus-Time do you need any help with this? Or is this at a state where it needs API decisions from maintainers?

Yep, the main one I suppose is whether Offset ranges are ever used in parquet readers - if they're not, there's a few variants and optionals we can throw out.

Other than that, naming and the get_bytes vs get_byte_ranges deviations.

kylebarron · 2024-08-21T03:57:52Z

Maybe it's worth taking this out of draft mode then?

Xuanwo · 2024-08-21T04:42:08Z

parquet/src/arrow/async_reader/metadata.rs

@@ -52,7 +51,44 @@ impl<F: MetadataFetch> MetadataLoader<F> {
    /// Create a new [`MetadataLoader`] by reading the footer information
    ///
    /// See [`fetch_parquet_metadata`] for the meaning of the individual parameters
-    pub async fn load(mut fetch: F, file_size: usize, prefetch: Option<usize>) -> Result<Self> {
+    pub async fn load(mut fetch: F, prefetch: Option<usize>) -> Result<Self> {
+        let suffix = fetch.fetch(GetRange::Suffix(prefetch.unwrap_or(8))).await?;


Hi, thanks a lot for this effort. However, using suffix fetch by default may break users on storage services that don't support this, like azblob. Have you considered not changing load, but adding a new API called load_without_size, load_with_suffix_fetch, or something similar?

As described here, I believe the default implementation for azure is to make two requests: one for the the length and another for the suffix data. But then this is much more performant on all other platforms

As described here, I believe the default implementation for azure is to make two requests: one for the the length and another for the suffix data. But then this is much more performant on all other platforms

Hi, I disagree with this because, in most cases, we already have the file size from ListObjects or other metadata services. So, this change doesn't perform better on other platforms and has a negative effect on azblob.

Would you reconsider the choice by adding a new function instead of changing the existing one? This would also allow us to include it in the next minor version, avoiding a breaking change.

I don't have a preference myself whether the default load uses a suffix request or not. As @H-Plus-Time noted above, we're looking for consensus on this.

Aspects that need consensus:

which of the two options (suffix, or non-suffix) gets the load method name (i.e. which is the default), and what the non-default method name should be.

Separately,

in most cases, we already have the file size from ListObjects or other metadata services

this seems to depend heavily on your use case. In my case I rarely have this information already.

include it in the next minor version, avoiding a breaking change

This is moot anyways, because the next release is breaking, right?

This is moot anyways, because the next release is breaking, right?

Thanks for pointing out this, let's ignore this part.

tustvold · 2024-08-24T15:41:16Z

I'm pretty apprehensive about changing the default behaviour to use suffix requests given there are stores that are known to not support them.

I wonder what you think of instead performing the initial suffix read of the prefetch size manually, and then constructing an AsyncFileReader with these bytes cached? This initial fetch request would tell you the total size of the object and so no changes would be necessary to traits, allowing us to keep the current simple interface and avoid potential compatibility footguns, but also allow you to avoid making additional HEAD requests for your use case?

H-Plus-Time · 2024-08-26T01:46:36Z

I'm pretty apprehensive about changing the default behaviour to use suffix requests given there are stores that are known to not support them.

I wonder what you think of instead performing the initial suffix read of the prefetch size manually, and then constructing an AsyncFileReader with these bytes cached? This initial fetch request would tell you the total size of the object and so no changes would be necessary to traits, allowing us to keep the current simple interface and avoid potential compatibility footguns, but also allow you to avoid making additional HEAD requests for your use case?

I agree with defaulting to non-suffix requests (consensus seems to be reached on that).

The manual approach isn't an option in my main use-case (browser side fetch as the underlying IO method) - usually resources are cross-origin, and Content-Range is not CORS-safelisted (the target origin must include it in Access-Control-Expose-Headers), so you can't determine the object's size from a ranged GET request alone without considerable intervention.

Shifting the suffix request path to a secondary load_without_size, and probably re-emphasising the file_size parameter (that is, make it an explicit choice on the users' part to not give the reader that information), I think that would be the least disruptive change.

Perhaps the following would be sufficient:

/// store.rs
impl ParquetObjectReader {
  pub fn new(store: Arc<dyn ObjectStore>, meta: ObjectMeta) -> Self
  pub fn new_without_size(store: Arc<dyn ObjectStore>, location: Path) -> Self
}

/// metadata.rs

impl<F: MetadataFetch> MetadataLoader<F> {
  pub async fn load(mut fetch: F, file_size: usize, prefetch: Option<usize>) -> Result<Self>
  pub async fn load_without_size(mut fetch: F, prefetch: Option<usize>) -> Result<Self>
}

// fetch_parquet_metadata elided for brevity - file_size: usize -> Option<usize>

With the above, the remaining footguns (calling fetch_parquet_metadata with file_size=None, calling load_without_size, or supplying a suffix range directly to get_bytes) would require quite a lot of intent.

tustvold · 2024-08-26T07:43:38Z

Can you not add the header to https://docs.aws.amazon.com/AmazonS3/latest/userguide/ManageCorsUsing.html#cors-expose-headers. Given you have to configure CORS for anything to work, it seems requiring an additional header isn't that burdensome?

H-Plus-Time · 2024-08-26T08:25:06Z

These are typically origins not controlled by the requester, configured with boilerplate CORS - unlikely to ever change.

tustvold · 2024-08-26T09:08:03Z

So long as we don't break existing workloads, I don't see an issue. However, I will observe that the code commonality between a suffix and range based MetadataLoader will likely be relatively small... Rather than adding code to support what is a fairly niche use-case, it might be worth this living in a repository dedicated to WASM support? Given the first-party S3 implementation doesn't support WASM32, let alone CORS-filtered responses that are missing expected headers, I suspect you must be rolling your own anyway?

kylebarron · 2024-08-26T12:38:09Z

I don't quite understand why non-WASM APIs wouldn't want first-class suffix-based loading support as well? I'm building APIs for both WASM/JS and Python, and in Python it seems a suffix-based loading approach would always be preferable when the user passes in a string URL or object store path (whenever the API doesn't already have the content length).

For all object stores that support suffix requests, we save one request by getting the content length and metadata length at the same time.
In the Azure case that doesn't support suffix requests, the ObjectStore fallback does a head then range, which is no worse than now whenever the ObjectMeta is not already known.

tustvold · 2024-08-26T13:17:12Z

I don't quite understand why non-WASM APIs wouldn't want first-class suffix-based loading support as well

This is already supported as I articulated above - #6157 (comment), WASM is a particular unique case where CORS enforcement strips the Content-Range header from the response

H-Plus-Time · 2024-08-27T10:01:00Z

Downstream consumers in Rust or Rust+PyO3 using the cache-populating approach you (@tustvold) suggested would still need to roll their own cache, right?

As for the wasm32 usecase - yes, the object_store implementation(/s) are 3rd party, but not the parquet machinery. The volume of code required to do what this PR proposes externally is significant - a full clone of ParquetObjectReader, more or less full reimplementation of the MetadataLoader (due to a combination of remainder being (correctly) private, and the bumpalloc-like nature of wasm's linear memory).

tustvold · 2024-08-27T10:18:34Z

Right, I'll try to find some time to work out a better way to proceed here over the next few days. It seems unfortunate to add suffix fetches to AsynFileReader when they're only applicable to metadata fetching, which is already a separate method

Edit: see also #6002 for some other discussion on parquet metadata

alamb · 2024-09-18T20:12:17Z

I am depressed about the large review backlog in this crate. We are looking for more help from the community reviewing PRs -- see #6418 for more

alamb · 2024-10-01T19:44:14Z

Marking as draft as we don't think this one is waiting for feedback anymore

kylebarron · 2024-10-01T19:47:07Z

The latest update was from @tustvold:

Right, I'll try to find some time to work out a better way to proceed here over the next few days.

I'm not sure if there's other work to do before a re-review of this PR.

It may be that some of this PR has been superseded by all the progress on the ParquetMetaDataReader?

etseidl · 2024-10-01T20:33:52Z

It may be that some of this PR has been superseded by all the progress on the ParquetMetaDataReader?

sadly not :( I deferred the suffix reading, hoping for a solution here first. ParquetMetaDataReader still needs the file size to do async reads.

tustvold · 2024-10-06T22:04:03Z

I'm afraid I've not had time to look into this nor to follow the recent changes to parquet metadata loading. I suspect what needs to happen is for someone to find a way to incorporate this into the new ParquetMetaDataReader, but I'll confess to not really knowing what that might look like.

Ideally we can find a way to share code effectively, whilst also not altering the current behaviour. I would be less excited about an approach that simply creates an entirely new codepath

Parquet/async: Swap default MetadataLoader::load behaviour to suffix …

c3e8e44

…requests, swap get_bytes to take a GetRange (equivalent to object-store's)

github-actions bot added the parquet Changes to the parquet crate label Jul 30, 2024

H-Plus-Time changed the base branch from master to 53.0.0-dev July 30, 2024 09:40

H-Plus-Time changed the base branch from 53.0.0-dev to master July 30, 2024 09:41

H-Plus-Time added 2 commits July 30, 2024 19:42

cargo fmt

73cd820

Fix read_with_rowgroup test

b671545

kylebarron reviewed Jul 30, 2024

View reviewed changes

tustvold added the api-change Changes to the arrow API label Jul 31, 2024

Add tests, implementation for MetadataLoader::load_absolute

76a4633

H-Plus-Time force-pushed the parquet-metadata-suffix-requests branch 2 times, most recently from 9e573ab to 059b4e4 Compare August 1, 2024 10:46

Address MetadataLoader clippy errors, correct docs + add a suffix exa…

e9c2800

…mple

H-Plus-Time force-pushed the parquet-metadata-suffix-requests branch from 059b4e4 to e9c2800 Compare August 1, 2024 10:54

H-Plus-Time marked this pull request as ready for review August 21, 2024 03:58

Xuanwo reviewed Aug 21, 2024

View reviewed changes

H-Plus-Time added 2 commits August 27, 2024 14:54

Swap precedence of file size agnostic and aware load methods

5f44af4

cargo fmt

3862419

tustvold mentioned this pull request Aug 27, 2024

API for encoding/decoding ParquetMetadata with more control #6002

Closed

kylebarron mentioned this pull request Aug 31, 2024

Prepare arrow/parquet 53.0.0 release #6338

Merged

Xuanwo mentioned this pull request Sep 18, 2024

POC: Add ParquetMetaDataReader #6392

Closed

etseidl mentioned this pull request Sep 24, 2024

Deprecate and replace old Parquet metadata parsing functions #6447

Open

6 tasks

alamb marked this pull request as draft October 1, 2024 19:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet/async: Default to suffix requests on supporting readers/object stores #6157

Parquet/async: Default to suffix requests on supporting readers/object stores #6157

H-Plus-Time commented Jul 30, 2024

H-Plus-Time commented Jul 30, 2024 •

edited

Loading

kylebarron left a comment

kylebarron Jul 30, 2024

H-Plus-Time Jul 31, 2024

H-Plus-Time Aug 1, 2024

kylebarron commented Aug 21, 2024

H-Plus-Time commented Aug 21, 2024

kylebarron commented Aug 21, 2024

Xuanwo Aug 21, 2024

kylebarron Aug 21, 2024 •

edited

Loading

Xuanwo Aug 23, 2024

kylebarron Aug 23, 2024

Xuanwo Aug 23, 2024

tustvold commented Aug 24, 2024 •

edited

Loading

H-Plus-Time commented Aug 26, 2024 •

edited

Loading

tustvold commented Aug 26, 2024

H-Plus-Time commented Aug 26, 2024

tustvold commented Aug 26, 2024

kylebarron commented Aug 26, 2024 •

edited

Loading

tustvold commented Aug 26, 2024 •

edited

Loading

H-Plus-Time commented Aug 27, 2024

tustvold commented Aug 27, 2024 •

edited

Loading

alamb commented Sep 18, 2024

alamb commented Oct 1, 2024

kylebarron commented Oct 1, 2024

etseidl commented Oct 1, 2024

tustvold commented Oct 6, 2024

	let mut buffer = Vec::with_capacity(to_read.unwrap_or(1_024usize));
	let mut buffer = Vec::with_capacity(to_read.unwrap_or(1024));

Parquet/async: Default to suffix requests on supporting readers/object stores #6157

Are you sure you want to change the base?

Parquet/async: Default to suffix requests on supporting readers/object stores #6157

Conversation

H-Plus-Time commented Jul 30, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

H-Plus-Time commented Jul 30, 2024 • edited Loading

kylebarron left a comment

Choose a reason for hiding this comment

kylebarron Jul 30, 2024

Choose a reason for hiding this comment

H-Plus-Time Jul 31, 2024

Choose a reason for hiding this comment

H-Plus-Time Aug 1, 2024

Choose a reason for hiding this comment

kylebarron commented Aug 21, 2024

H-Plus-Time commented Aug 21, 2024

kylebarron commented Aug 21, 2024

Xuanwo Aug 21, 2024

Choose a reason for hiding this comment

kylebarron Aug 21, 2024 • edited Loading

Choose a reason for hiding this comment

Xuanwo Aug 23, 2024

Choose a reason for hiding this comment

kylebarron Aug 23, 2024

Choose a reason for hiding this comment

Xuanwo Aug 23, 2024

Choose a reason for hiding this comment

tustvold commented Aug 24, 2024 • edited Loading

H-Plus-Time commented Aug 26, 2024 • edited Loading

tustvold commented Aug 26, 2024

H-Plus-Time commented Aug 26, 2024

tustvold commented Aug 26, 2024

kylebarron commented Aug 26, 2024 • edited Loading

tustvold commented Aug 26, 2024 • edited Loading

H-Plus-Time commented Aug 27, 2024

tustvold commented Aug 27, 2024 • edited Loading

alamb commented Sep 18, 2024

alamb commented Oct 1, 2024

kylebarron commented Oct 1, 2024

etseidl commented Oct 1, 2024

tustvold commented Oct 6, 2024

H-Plus-Time commented Jul 30, 2024 •

edited

Loading

kylebarron Aug 21, 2024 •

edited

Loading

tustvold commented Aug 24, 2024 •

edited

Loading

H-Plus-Time commented Aug 26, 2024 •

edited

Loading

kylebarron commented Aug 26, 2024 •

edited

Loading

tustvold commented Aug 26, 2024 •

edited

Loading

tustvold commented Aug 27, 2024 •

edited

Loading